27 research outputs found

    Robust detection of periodic time series measured from biological systems

    Get PDF
    BACKGROUND: Periodic phenomena are widespread in biology. The problem of finding periodicity in biological time series can be viewed as a multiple hypothesis testing of the spectral content of a given time series. The exact noise characteristics are unknown in many bioinformatics applications. Furthermore, the observed time series can exhibit other non-idealities, such as outliers, short length and distortion from the original wave form. Hence, the computational methods should preferably be robust against such anomalies in the data. RESULTS: We propose a general-purpose robust testing procedure for finding periodic sequences in multiple time series data. The proposed method is based on a robust spectral estimator which is incorporated into the hypothesis testing framework using a so-called g-statistic together with correction for multiple testing. This results in a robust testing procedure which is insensitive to heavy contamination of outliers, missing-values, short time series, nonlinear distortions, and is completely insensitive to any monotone nonlinear distortions. The performance of the methods is evaluated by performing extensive simulations. In addition, we compare the proposed method with another recent statistical signal detection estimator that uses Fisher's test, based on the Gaussian noise assumption. The results demonstrate that the proposed robust method provides remarkably better robustness properties. Moreover, the performance of the proposed method is preferable also in the standard Gaussian case. We validate the performance of the proposed method on real data on which the method performs very favorably. CONCLUSION: As the time series measured from biological systems are usually short and prone to contain different kinds of non-idealities, we are very optimistic about the multitude of possible applications for our proposed robust statistical periodicity detection method. AVAILABILITY: The presented methods have been implemented in Matlab and in R. Codes are available on request. Supplementary material is available at:

    Simulation of microarray data with realistic characteristics

    Get PDF
    BACKGROUND: Microarray technologies have become common tools in biological research. As a result, a need for effective computational methods for data analysis has emerged. Numerous different algorithms have been proposed for analyzing the data. However, an objective evaluation of the proposed algorithms is not possible due to the lack of biological ground truth information. To overcome this fundamental problem, the use of simulated microarray data for algorithm validation has been proposed. RESULTS: We present a microarray simulation model which can be used to validate different kinds of data analysis algorithms. The proposed model is unique in the sense that it includes all the steps that affect the quality of real microarray data. These steps include the simulation of biological ground truth data, applying biological and measurement technology specific error models, and finally simulating the microarray slide manufacturing and hybridization. After all these steps are taken into account, the simulated data has realistic biological and statistical characteristics. The applicability of the proposed model is demonstrated by several examples. CONCLUSION: The proposed microarray simulation model is modular and can be used in different kinds of applications. It includes several error models that have been proposed earlier and it can be used with different types of input data. The model can be used to simulate both spotted two-channel and oligonucleotide based single-channel microarrays. All this makes the model a valuable tool for example in validation of data analysis algorithms

    Computational Methods for Estimation of Cell Cycle Phase Distributions of Yeast Cells

    Get PDF
    Two computational methods for estimating the cell cycle phase distribution of a budding yeast (Saccharomyces cerevisiae) cell population are presented. The first one is a nonparametric method that is based on the analysis of DNA content in the individual cells of the population. The DNA content is measured with a fluorescence-activated cell sorter (FACS). The second method is based on budding index analysis. An automated image analysis method is presented for the task of detecting the cells and buds. The proposed methods can be used to obtain quantitative information on the cell cycle phase distribution of a budding yeast S. cerevisiae population. They therefore provide a solid basis for obtaining the complementary information needed in deconvolution of gene expression data. As a case study, both methods are tested with data that were obtained in a time series experiment with S. cerevisiae. The details of the time series experiment as well as the image and FACS data obtained in the experiment can be found in the online additional material at http://www.cs.tut.fi/sgn/csb/yeastdistrib/

    Robust regression for periodicity detection in non-uniformly sampled time-course gene expression data

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>In practice many biological time series measurements, including gene microarrays, are conducted at time points that seem to be interesting in the biologist's opinion and not necessarily at fixed time intervals. In many circumstances we are interested in finding targets that are expressed periodically. To tackle the problems of uneven sampling and unknown type of noise in periodicity detection, we propose to use robust regression.</p> <p>Methods</p> <p>The aim of this paper is to develop a general framework for robust periodicity detection and review and rank different approaches by means of simulations. We also show the results for some real measurement data.</p> <p>Results</p> <p>The simulation results clearly show that when the sampling of time series gets more and more uneven, the methods that assume even sampling become unusable. We find that M-estimation provides a good compromise between robustness and computational efficiency.</p> <p>Conclusion</p> <p>Since uneven sampling occurs often in biological measurements, the robust methods developed in this paper are expected to have many uses. The regression based formulation of the periodicity detection problem easily adapts to non-uniform sampling. Using robust regression helps to reject inconsistently behaving data points.</p> <p>Availability</p> <p>The implementations are currently available for Matlab and will be made available for the users of R as well. More information can be found in the web-supplement <abbrgrp><abbr bid="B1">1</abbr></abbrgrp>.</p

    Feature selection in omics prediction problems using cat scores and false non-discovery rate control

    No full text
    We propose an effective framework for high-dimensional linear discriminant analysis (LDA) based on three key elements: James-Stein shrinkage for learning prediction rules, feature ranking by correlation-adjusted t-scores (cat scores), and feature selection by thresholding and controlling false non-discovery rates (FNDR). Relative to competing LDA approaches our algorithm is computationally inexpensive and makes practical high-dimensional LDA analysis. Furthermore, we show on four experimental data sets and by comparing with the “higher criticism ” approach that feature selection by FNDR control is very effective not only for LDA but also for diagonal discriminant analysis. The proposed shrinkage discriminant and variable selection procedure is implemented in the R package “sda ” available from the R repository CRAN

    Disambiguate: An open-source application for disambiguating two species in next generation sequencing data from grafted samples [version 2; referees: 3 approved]

    No full text
    Grafting of cell lines and primary tumours is a crucial step in the drug development process between cell line studies and clinical trials. Disambiguate is a program for computationally separating the sequencing reads of two species derived from grafted samples. Disambiguate operates on DNA or RNA-seq alignments to the two species and separates the components at very high sensitivity and specificity as illustrated in artificially mixed human-mouse samples. This allows for maximum recovery of data from target tumours for more accurate variant calling and gene expression quantification. Given that no general use open source algorithm accessible to the bioinformatics community exists for the purposes of separating the two species data, the proposed Disambiguate tool presents a novel approach and improvement to performing sequence analysis of grafted samples. Both Python and C++ implementations are available and they are integrated into several open and closed source pipelines. Disambiguate is open source and is freely available at https://github.com/AstraZeneca-NGS/disambiguate

    Distributed under Creative Commons CC-BY 4.0 Prioritisation of structural variant calls in cancer genomes

    No full text
    ABSTRACT Sensitivity of short read DNA-sequencing for gene fusion detection is improving, but is hampered by the significant amount of noise composed of uninteresting or false positive hits in the data. In this paper we describe a tiered prioritisation approach to extract high impact gene fusion events from existing structural variant calls. Using cell line and patient DNA sequence data we improve the annotation and interpretation of structural variant calls to best highlight likely cancer driving fusions. We also considerably improve on the automated visualisation of the high impact structural variants to highlight the effects of the variants on the resulting transcripts. The resulting framework greatly improves on readily detecting clinically actionable structural variants

    Robust regression for periodicity detection in non-uniformly sampled time-course gene expression data-1

    No full text
    <p><b>Copyright information:</b></p><p>Taken from "Robust regression for periodicity detection in non-uniformly sampled time-course gene expression data"</p><p>http://www.biomedcentral.com/1471-2105/8/233</p><p>BMC Bioinformatics 2007;8():233-233.</p><p>Published online 2 Jul 2007</p><p>PMCID:PMC1934414.</p><p></p>s time in hours and the first time point corresponds to 8:40 am. The approximately 24-hour cycle can be seen well. The figure legends show the gene names corresponding to the plotted time series

    Robust regression for periodicity detection in non-uniformly sampled time-course gene expression data-0

    No full text
    <p><b>Copyright information:</b></p><p>Taken from "Robust regression for periodicity detection in non-uniformly sampled time-course gene expression data"</p><p>http://www.biomedcentral.com/1471-2105/8/233</p><p>BMC Bioinformatics 2007;8():233-233.</p><p>Published online 2 Jul 2007</p><p>PMCID:PMC1934414.</p><p></p>led according to the experimental mussel data. The sampling of the second time series (c) is an artificially deteriorated version of the first one. The corresponding spectral estimates, (b) and (d), include the ideal periodogram (Ideal periodogram), as if the time series was sampled uniformly and had no added noise, the periodogram of the samples (Periodogram). ignoring time indices, and the M-estimate (Robust (M) estimator)
    corecore